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We study the degree to which a character string, Q, leaks details about itself any time it engages in 
comparison protocols with a strings provided by a querier, Bob, even if those protocols are cryptograph- 
O ^ ically guaranteed to produce no additional information other than the scores that assess the degree to 

which Q matches strings offered by Bob. We show that such scenarios allow Bob to play variants of the 
game of Mastermind with Q so as to learn the complete identity of Q. We show that there are a number 
of efficient implementations for Bob to employ in these Mastermind attacks, depending on knowledge 
he has about the structure of Q, which show how quickly he can determine Q. Indeed, we show that 
I— I Bob can discover Q using a number of rounds of test comparisons that is much smaller than the length 

C/^ of Q, under reasonable assumptions regarding the types of scores that are returned by the cryptographic 

(H^ protocols and whether he can use knowledge about the distribution that Q comes from. We also provide 

(/3 the results of a case study we performed on a database of mitochondrial DNA, showing the vulnerability 

O of existing real-world DNA data to the Mastermind attack. 

Keywords: character strings. Mastermind, mitochondrial DNA. 
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00 1 Introduction 

^ Mastermind p l0|25[ is a game played between two players — a codemaker and a codebreaker — using colored 

pegs. (See Figure [T]) 

Viewed mathematically, Mastermind is abstracted as a game where the codemaker selects a plaintext 
string Q, of length N, whose elements are selected from an alphabet of size K. For consistency with 
the board game, the members of this alphabet are often referred to as "colors." The codemaker and code- 
. ^ breaker both know the values of N and K, and play consists of the codebreaker repeatedly making guesses, 

^ ^1) ^2, • • about the identity of Q. For each guess, Vi the codemaker provides a score on how well Vi 

^ matches Q. In double-count Mastermind, which is the standard version based on the board game, this score 

consists of a pair of two numbers: 

• A black count, b{Q, Vi), which is the number of elements in Vi and Q that match in both value and 
location. That is, 

KQ.V) = \{r.v\j] = Qm\- 
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Figure 1: The Mastermind game. The four large pegs in the middle are used for guessing. The four 
smaller peg locations on the right are used to score each guess — with black-peg and white -peg scores. 
And the two pegs on the left are used to keep score across multiple games. (This image is adapted from 
http://en.wikipedia.0rg/wiki/File:Mastermind.jpg, by User:ZeroOne, under the Creative Commons Attribu- 
tion ShareAlike 2.0 License.) 



• A white count, w{Q, Vt), which is the number of elements in Vi that appear in Q but in different 
locations than their locations in Vi. That is, letting vr denote an arbitrary permutation, 

w{Q,Vi)=m8.x\{j: Vi[7T{j)] = Q[j]}\ - b{Vi). 

71" 

In single-count Mastermind, which has been less studied, the codebreaker is given only the black count, 
b{Q, Vi), for each guess, Vi. (Note that it is impossible to solve the problem given only white-count scores.) 
The goal is for the codebreaker to discover Q using a small a number of guesses. 

1.1 Previous Related Work 

The original Mastermind game was invented in 1970 by Meirowitz, as a board game having holes for vectors 



of length = 4 and K = 6 colored pegs. Knuth |25| subsequently showed that this instance of the 
Mastermind game can be solved in five guesses or less. Chvatal 1 10] studied the combinatorics of general 
Mastermind, showing that it can be solved in polynomial time, in the K > N case, using 2A^[log K~\ + AN 
guesses, and Chen et al. ||9) showed how this bound can be improved, in this case, to 2 [log A^] + 2 + 



\K/NA^ + 2 guesses. Stuckman and Zhang |33| showed that is NP-complete to determine if a sequence of 
guesses and responses in general double-count Mastermind is satisfiable. Goodrich [20] shows that single- 
count (black-peg) Mastermind satisfiability is NP-complete and that a specific vector Q can be guessed using 
a single-count (black-peg) query vector that is of length A^ [log K^^ + \{2 — 1/ K)N^^ + K. 
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Several researchers have explored privacy-preserving data querying methods that can be applied to char- 
acter strings (e.g., see |[2 15] 16]). In particular, Atallah et al. fT\ and Atallah and Li [3] studied privacy- 



preserving protocols for edit-distance string comparisons, such as in the longest common subsequence (LCS) 
problem | [2T]22l 36|, where each party learns the score for the comparison, but neither learns the contents 



of the string of the other party. Such comparisons are common in DNA sequence alignment comparisons, 
for example. Troncoso-Pastoriza et al. [35] described a privacy-preserving protocol for searching for a cer- 
tain regular-expression pattern in a DNA sequence. In last-year's Oakland conference, Jha et al. |23| give 
privacy-preserving protocols for computing edit distance similarity scores between two genomic sequences, 
improving the privacy-preserving edit distance algorithm of Szajda et al. [34]. Single-count matching results 
between two strings can be done in a privacy -preserving manner, as well, using privacy-preserving set inter- 
section, e.g., using the method of Freedman et al. [ 16], Vaidya and Clifton [37] or Sang and Shen [ |3T][32| . 
The string matching problem can also be done using privacy-preserving dot product computations [fT| or 
even general multi-party computation protocols (e.g., see ] 12 18] 39]) or systems [j6|. Jiang et al. [24] 



study a secure mulitparty method for comparing a genomic sequence against every sequence in a genomic 
database, providing a score indicating the match strength between the query sequence and each sequence in 
the database. 

In terms of the framework of this paper, the closest previous work is that of Du and Atallah [14|, who 
studied a privacy -preserving protocol for querying a string Q in a database of strings, D, where comparisons 
are based on approximate matching (but not sequence-alignment). Their protocols assume that the parties 
are honest-but-curious, however, so that, for instance, the database owner cannot introduce fake strings in 
his database whose intent is to discover the identity of the query string, Q. The attack model we explore 
in this paper, on the other hand, allows for "cheating" in the comparison protocol, so that D can introduce 
strings whose sole purpose is to help him discover the identity of Q. 

1.2 Attack Scenarios 

In this paper we study the Mastermind attack on string data, which is a way that a genomic querier. Bob, 
can "play" a type of Mastermind game with an unknown string, Q-for which Q's owner, Alice, thinks that 
she is comparing with Bob in a privacy-preserving manner — ^but instead Bob is discovering the full identity 

of g. 

The attack scenario is that Alice repeatedly participates in privacy-preserving comparisons of Q to itera- 
tively compare Q with strings provided by Bob. All is learned from each comparison is the score measuring 
the similarity of the two strings {Q and a string Vi provided by Bob), with the score for each string com- 
parison being revealed to Bob (and possibly also Alice) before the next comparison begins. Bob's goal is to 
learn the complete identity of Q with a reasonably small of comparisons. 

We distinguish two versions of this attack scenario. In the first scenario, the comparison between Q and 
each string Vi provided by Bob is scored according to the single-count (black-peg) straight-match score, 

h{Q,v,) = \{r.v,[j\ = Qm. 

In our second scenario, which is more common in genomic databases, the comparison between Q and each 
Vi provided by Bob is scored according to a sequence-alignment score, 

a{Q,V^) = |{(j,fc) GX: V^[j] = Q[m, 

where X is an ordered index set of pairs of integers so that if (j, k) appears before {I, m) in X, then j < I 



and k < m. This is also known as the longest common subsequence (LCS) [21 22][36| score between Q and 
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Vi- (See Figure [2]) Incidentally, as we observe below, Levenshtein edit distance scores are strongly related 
to the LCS score, and our attack scenarios should be able to be translated to this other measure, as well. 

ACGGATGCCTT ACGGATGCCTT 

III III l//ll\l I 

ATGGCAGCATC AGGCATTGCAT 
(a) (b) 



Figure 2: Illustrating two types of matches between two DNA sequences, (a) A single-count (black-peg) 
straight-match. Note that the second "A" in the bottom string is not matched, since it doesn't line up exactly 
with the second "A" in the top string, (b) A sequence-alignment match. In going from the top string to the 
botttom string, the first "C" in the top string corresponds to a deletion event, the first "C" in the bottom string 
corresponds to an insertion event, and the penultimate characters in each string correspond to a substitution 
event. 

There are a number of motivating usage environments that could be susceptible to Mastermind attacks. 
For example. Bob could be a genomic database owner, storing genomic strings for a number of individuals, 
and Alice could be a database user who is searching Bob's database to find the closest match to a string Q 
of interest. Bob could, for instance, be the owner of a database of DNA from every male attending a certain 
university and Alice could be an FBI agent searching through that database for a match with DNA evidence 
gathered after a sexual assault. Both parties in this example are likely to be under legal restrictions not to 
reveal the complete identity of their strings unless there is a match. In another example, Alice could be the 
owner of a database of genomic sequences and Bob could be an attacker trying to learn the identity of a 
string Q in Alice's database, e.g., which Bob can identify only by an anonymized index, j. In this case. Bob 
repeatedly does queries with each of his strings, Vi, indexing into Alice's database using the name "j" to 
locate Q and get Alice to do a privacy-comparison of Q with Vi. Bob could, for instance, be an employer 
trying to learn the genomic sequence of a prospective employee, Charlie, by querying a university DNA 
sequence database owned by Alice, which he could query simply knowing the index of Charlie's DNA in 
Alice's database (e.g.. Bob might be able to infer this index from Charlie's student number). In every case. 
Bob gets to ask Alice to compare her string, Q, to each of his query strings, Vi, in a privacy-preserving 
manner until these comparisons have leaked enough information that he can easily infer the identity of Q. 

1.3 Our Results 

In this paper we study various aspects of the Mastermind attack, deriving the following results. 

• We show that the problem of determining whether a sequence of Mastermind responses has a valid 
solution is NP-complete even if each response is a sequence-alignment response. 

At first, this might seem to provide some security for the privacy of the unknown string, Q, for it implies a 
degree of intractability to the problem of learning a query string Q just from Mastermind responses involving 
Q. Unfortunately, as was learned with Knapsack cryptosystems [28], having the security of a system be 
based on the difficulty of solving an NP-complete problem is no guarantee that it is safe in practice. Indeed, 
such is the case for the security of genomic sequences being susceptible to the Mastermind attack. We show 
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that character strings can be discovered by surprisingly short sequence of guesses. In particular, we also 
provide the following results: 

• We show that an arbitrary query string, Q, of length N from an alphabet of size K, can be discovered 
with (A^ + 2)K queries, each of which reports the result of a sequence-alignment (LCS) test. Such 
queries are common in genomic applications. We also show that this bound can be further improved 
if the distribution of characters in the alphabet follows Zipf's Law [27]. 

• We show how a Mastermind attacker can take advantage of known distributional information for ge- 
nomic data. Armed with distributional knowledge about a query string, Q, with respect to a reference 
string, R, such as the Revised Cambridge Reference Sequence, rCRS (GenBank accession number: 
AC 000021), the Mastermind attacker can discover Q much quicker than in the general cases, using 
either single-count or sequence-alignment responses. 

• We provide experimental analysis of the distribution-based Mastermind attack for genomic data, 
showing that, for a case study involving mitochondrial DNA (mtDNA), either single-count responses 
or sequence-alignment responses, the attack works surprisingly well. Given the relative abundance of 
mtDNA data, and its ethnic sensitivity, we focus our experiments on 1000 human mtDNA sequences, 
showing that most can be discovered with a Mastermind attack of just a few hundred guesses, even 
though mtDNA sequences are typically over 16,500 bp long. Given that current mtDNA databases 
already have thousands of members (e.g., see |[5|), this experimental analysis shows that it would 
be relatively easy for an attacker. Bob, to interleave an undetected Mastermind attack with privacy- 
preserving responses to actual sequences. 

We conclude by discussing some of the issues that would have to be addressed in order to defeat Mas- 
termind attacks on genomic data, as well as some possible directions for future research. 

2 Alternative Sequence Comparison Scores 

Throughout this paper, we assume that the attacker. Bob, can learn the value of either a straight-match score, 
b{Q, Vi), or a sequence-alignment score, a{Q, Vi), between the unknown string, Q, and each of his given 
strings, Vi. These are not the only types of scores of interest with respect to genomic data, however. So, 
before we discuss the privacy risks of genomic data from Mastermind attacks that use the 6 or a functions as 
scores, let us discuss two other kinds of score functions and how they could alternatively be used for similar 
attacks. 

There are a number of score functions that measure the similarity between two strings. We review two 
here, including how they can be reduced to similarity measures using the functions b or a, for comparing 
two strings, Q and V. 

• Hamming distance: the Hammming distance, H{Q, V), between Q and V , is given by 

H{Q,V) = \{r. v[j\^Qm- 

That is, the two strings Q and V are aUgned in way that disallows insertions and deletions, and 
a score is computed based on the number of substitutions needed to convert Q to V . Note that, 
given a Hamming distance score, H{Q, V), we can compute a straight-match score as h{Q, V) = 
\Q\-H{Q,V). 
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• Levenshtein distance: the Levenshtein distance, L{Q, V), between Q and V , which is a kind of 
edit distance, is the minimum number of insertions, deletions, and substitutions needed to convert Q 
into V (or vice versa). Note that, given a Levenshtein distance score, L{Q,V), we can compute a 
sequence-alignment score as 

\Q\ + \V\-L{Q,V) 
a{Q, V) = . 

Thus, the Mastermind attacks we mention in this paper apply equally well to systems that support string 
comparisons using Hamming distance or Levenshtein distance. 



3 NP- Completeness of Sequence- Alignment Mastermind Satisfiability 

As mentioned above, Stuckman and Zhang | ,33J show that double-count Mastermind satisfiability is NP- 



complete and Goodrich |20| shows that single-count (black-peg) Mastermind satisfiability is also NP- 
complete (which applies equally well for Hamming distance). 

In the Sequence- Alignment Mastermind Satisfiability problem, we are given a collection of Mastermind 
queries, Vi, V2,. . ■ , Vn, and the responses, a{Q, Vi),a{Q, V2), . . . , a{Q, Vn), each of which is said to re- 
port the sequence-alignment (LCS) score between each Vi and an unknown vector, Q. We are asked to 
determine if there indeed exists a vector Q that satisfies all of these responses. 

Theorem 1: Sequence-Alignment Mastermind Satisfiability is NP-complete. 

Proof: Our proof is an adaptation of the NP-completeness proof of Goodrich [20] showing that single- 
count (black-peg) Mastermind Satisfiability is NP-complete. It is easy to see that Sequence-Alignment 
Mastermind Satisfiability is in NP. For example, we could nondeterministically guess a vector Q and then 
test in polynomial time whether it satisfies all the responses, a(Q, Vi),a{Q, V2), . . . , a{Q, Vn). 

To prove that Sequence-Alignment Mastermind Satisfiability is NP-hard, we provide a reduction from 
3-Dimensional Matching (3DM), which is a well-known NP-complete problem (e.g., see (17,]). In the 3DM 
problem, we are given three sets, X = {xi, . . . , Y = {yi, . . . , and Z = {zi, . . . , z„}, of n 
elements each. In addition, we are given a set T of m triples, {(xj^, , z^^), . . . , {xi^,yj^, Zk^)}, whose 
elements are respectively taken from the three sets, X, Y, and Z. The problem is to determine if there is a 
subset of triples such that each element in X, Y, and Z appears in exactly one triple in this subset. 

Suppose, then, that we are given an instance of the 3DM problem, as described above. We consider the 
unknown vector, Q, to consist of the following vector of variables: 

{Xi, . . . , X2n', Yi, . . . , Y2n', Zi, . . . , Z2n', Ti, . . . , T2m-l), 

where the semi-colons are used for the sake of notation to separate the four sections in the unknown vector, 
Q. We perform our reduction by constructing a sequence of guess vectors, Vo,Vi, . . . , Vat, together with 
their sequence-alignment responses, a{Q, Vq), a{Q, Vi), . . . , a{Q, Vn), so that there is a satisfying vector 
Q for these responses if and only if there is a solution to the given instance of the 3DM problem. 

Our construction begins by setting the number of colors, K, to be m + 2. Intuitively, there is a color 
associated with each triple in T, plus a "null" color, (j), which is guaranteed to appear nowhere in our 
unknown vector, Q, and a separator color, /i, which occurs in every other (even-indexed) position of Q. We 
begin our sequence of queries with four special "enforcer" queries. The first two of these are 

Vq = {(/),..., (f); (j), (f); <l), (f); (j), (I)), 
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which has response a{Q, Vb) = 0, and 

Vi = {fi, . . . , n; n, . . . , fi; fi, . . . , fi: n, . . . , fi), 

which has response a{Q, Vi) = 3n + m — 1. Intuitively, Vq enforces the fact that the null color, (/), appears 
nowhere in the unknown vector, and Vi enforces the fact that the separator color, /j,, appears exactly often 
enough to separate every other (non-/x) character in the unknown vector. So as to better understand the 
characteristics of the other queries, let us set /i = 3n + m — 1, the number of n colors in our unknown vector 
Q. We then define two additional enforcer queries, 

which has response a{Q, V2) = h + n, and 

(l),li,...,H,(f>H; 0,iJ,,0,i^,...,n,0), 

which has response a{Q, V3) = h + m — n. Intuitively, V2 enforces a counting rule that exactly n of the 
Tj's will be set to 1, and V3 enforces a counting rule that the remaining m — n of the Tj's will be set to 0. 
For each triple, = [xi^ , yj^ , z^^), we construct three query vectors, as follows. 

Vs,l = 

{(p,H,..., 11, 4>, II, s,iJ,,4>,ii,..., (p, ii; (p, n, . . . , n, 4>, 11; 
^, /X, . . . , iu, ^, iu; ^, /X, . . . , /X, (f), n, 0,11,(1), II,..., n, (f)), 

where the s is in position 2is — 1 in the first group and the is in position 2s — 1 in the fourth group. This 
vector has response, a{Q, Vg^i) = h + 1. 

Vs,2 = 

{4), II,..., n, (t),n; ^,11,..., n, 4), II, s,n,4),ix,..., n, (p, n; 
(l),H,...,H,(t),ii; <p,n,...,n, (f), n, 0,n,(f),n,..., IX, 4), 

where the s is in position 2js — 1 in the second group and the is in position 2s — 1 in the fourth group. 
This vector has response, a{Q, ¥3,2) = h + l. 

Vs,3 = 

{4>,li,. . . ,H,<p,li; 4>,ii,...,ii,(j),n; 
4,11,..., IX, 4), IX, s, 11, 4), 11,..., n, 4>, ix; 

4,ix,...,ix,4>, II, 0,11, 4), II,..., II, 4>), 

where the s is in position 2ks — 1 in the third group and the is in position 2s — 1 in the fourth group. This 
vector has response, a{Q, 14,3) = h + \. Intuitively, these three responses collectively form a "chooser" 
gadget, where we will either have T2S-1 = or the three variables X2i^-i, Y2j^-i, and ^2A:s-i» will each 
be set to have color s (and T2s-\ = 1). Moreover, note that there are m odd-index positions in the T, and 
each of them has to match either a or 1 color. 

This reduction can clearly be done in polynomial time. So all that remains is for us to show that it works. 
Suppose, then, that there is a possible solution to the given instance of 3DM. Then for each chosen triple, 
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Ts = (j;i^,yj^,2;fej, we can assign colors r2s_i = 1, X2i^-i = s, 1^2j,-i = s, and ^2^,-1 = s, which will 
satisfy each of the 14,1, Vs,2, and 14,3 vector responses for this value of s. Likewise, setting T2S-1 = will 
satisfy each of the Vg^i, V^s,2> and Vs^s vector responses for a triple T2S-1 that is not chosen. Finally, given 
that there are n chosen vectors, we will satisfy the four preliminary vector responses as well. 

Suppose, alternatively, that we have a vector Q that satisfies all our vector responses. We know that 
each Xi, Yj, and must be assigned a color other than 0. Moreover, every even-indexed position in Q 
must be assigned the color and every odd-indexed position must be a color other than fi, because there 
are exactly h = 3n + m — I instances of /i in Q and we have introduced a query that enforces the fact that 
there is exactly one non-/u color between every consecutive pair of /^-colored positions. Since there are only 
m + 2 colors, this implies each odd-indexed position X2i-i, ^2j-i, and Z2k-i must be assigned a color 
corresponding to a triple number, s, that is, it is not assigned or fi. If the corresponding T2S-1 = 1, then 
in order to have satisfied the vectors Vg^i, Vs^2, and 3, we must have set X2i^-i = s, ^2js-i = ^^'^ 
Z2ks-i = which implies we can include the triple {Xi^,Yj^ZkJ in our matching. If T2S-1 = 0, then we 
do not include this triple in our matching. By the vector responses V2 and V3, we know that the number of 
triples chosen in this way is exactly n. Thus, we have found a valid 3-dimensional matching. ■ 

Thus, it is extremely unlikely that we will be able to find a polynomial-time algorithm that can always 
satisfy arbitrary Mastermind sequence-alignment query strings, or even single-count queries [20]. Unfortu- 
nately, this is not the same as a guarantee of security for the kinds of query strings that would result from an 
interaction between a Mastermind attacker. Bob, and a character string owner, Alice, where Bob is trying to 
learn Alice's string, Q, through a sequence of privacy-preserving string comparisons. For we show, in the 
sections that follow, that such query strings, Q, can be discovered fairly efficiently using the Mastermind 
attack. 



4 The Mastermind Attack for Sequence-Alignment Queries 

Recall that in a sequence-alignment query we wish to compare two strings Q and V, where the score for 



a match is the length of the longest common subsequence (LCS) | [2T||22l|36| between Q and V. Several 
researchers have studied this problem and have come up with privacy-preserving protocols to determine 
such scores (e.g., see Q). In this section, we show that performing such a series of sequence-alignment 
queries with Bob is susceptible to a type of Mastermind attack of its own. 

Suppose we are given an unknown string Q of length N over an alphabet of size K, the members of 
which we call "colors." Suppose further that we are going to engage in a protocol with Bob to test Q against 
strings provided by Bob, where each test returns the length of a longest common subsequence between 
Q and one of Bob's strings. That is, we score matches using the sequence-alignment scoring function, 
a{Q, V), for a guess vector V, which is the length of a longest common subsequence between V and Q. We 
are interested in this section on studying an efficient scheme for Bob to discover Q using this query scheme. 

A Mastermind-attack algorithm for Bob begins as follows: 

• Bob begins by guessing K vectors, Vi, V2, . . . , Vk, with each vector Vi consisting of elements of all 
the same color, i. 

The subsequence alignment score for each of the initial guesses will tell Bob the cardinality of each 
color in Q. Let us now imagine that we reorder the colors so that they are listed 1 to K in nondecreasing 
order of how often they each appear in Q. Thus, color 1 is now the least frequent color in Q and K is 
the most frequent color. Our algorithm continues by incrementally building up a vector W, such that W 
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either completely matches all its characters with Q (in the specified order) or it misses by just one character. 
Initially, we set 1^ to be a vector consisting of exactly ci elements of color 1, so that if we were to guess W, 
then we would get a score of a{Q, W) = ci. We allow indexing and insertion into W so that we can add a 
character before the ith element in for z = 1 to + 1 (with an insertion "before" position + 1 taken 
to mean an insertion just after position \ W\, the last position in W). Our algorithm for Bob's Mastermind 
attack continues shown in Figure [3] 

for A; = 2 to K do {take each color in turn} 
Set i = I {position in W where to insert items} 
Set j = {count of number of items of color k found} 
while j < Ck do {find the places for color k} 
Add a color k item just before the ith item in W. 
Make a guess for W to learn the value of a{Q, W). 
if a{Q, W) = \W\ then {all of W matches} 

Increment i and j. 
else {there's one too many of color k before i} 
Remove the color k item before i. 
Increment i. 
end if 
end while 
end for 

Figure 3: The sequent-alignment learning algorithm. 

Note inductively that, at the end of each iteration of the the while-loop, every character in W matches 
in Q, that is, a{Q, W) = \W\. Thus, any time the if-statement finds that a{Q, W) 7^ \W\, then we have just 
added an item of color A; in a place where it cannot match any item without causing a previously-matched 
neighboring item to mis-match what it previously could match. Therefore, in each iteration of the for-loop, 
the algorithm correctly finds all the places where items of color k fit with respect to items of colors 1 to 
k — 1. So, when the algorithm completes, we have W = Q; that is, we have learned Q. 

Consider now the analysis of this algorithm. Note that in each iteration of the while-loop, we increment 
i, our index into W, and that at the end of the while loop the length of is ci + C2 + • • • + Cfc, where k is 
the index of the for-loop. Thus, the total number of queries made is at most 

K i 

1=1 j=i 

which is the same as 

K 

i=l 

since each term q appears K — i + 1 times in the double sum. Let us perform a substitution of variables, 
where we let di, 1^2, ... , dx denote the cardinalities of the colors in Q in nonincreasing order, so di is the 
most frequent color and dx is the least frequent. Then we can rewrite the total number of queries performed 
to be bounded by 

K 

K + ^idi. 

i=l 
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Note that, by definition, di < N/i, for otherwise, di could not be the ith largest-cardinality color. Thus, the 
total number of queries is at most 



K 



K + ^i{N/i) = K + KN 



i=l 



{N + l)K. 



This is the number of tests done by Bob, the Mastermind attacker, making no additional assumptions about 
the distribution of colors in the query string, Q. 

This analysis can be refined, however, if the colors are distributed in Q according to Zipf's Law fTl\, 
which in this context would imply that 

di < — — — , 
where Hj^^g is the A^-th Harmonic number of order s, 

N 
i=l 

and s is between 1 and 2, inclusive. In this case, the total number of guesses done by Bob would be at most 
for s > 1. Thus, we have the following: 



Theorem 2: Given an unknown length-N string Q, defined on an alphabet of size K, a malicious Master- 
mind attacker can discover Q in polynomial time using {N + 1)K sequence-alignment tests tests against 
Q, each of which reveals only the length of a longest common subsequence between Q and the test string 
match. If the cardinalities of elements of Q follow Zipf's Law, with parameter s > 1, then a malicious 
Mastermind attacker can discover Q using at most K + KN/ Hn^s sequence-alignment tests. 



5 Exploiting Data Distributions 

Up to this point, we have focused on how the Mastermind attacker. Bob, could learn a general string Q using 
the types of queries typically asked of genomic databases, even if those queries are privacy preserving. In 
this section, we explore how Bob can significantly improve the effectiveness of the Mastermind attack if he 
exploits information, which is publicly available, about the distributions of the character strings of interest. 
Moreover, to drive the point home, we provide a case study showing the effectiveness of such Mastermind 
attacks on a real-world genomic database, in the section that follows. 

Genomic sequences typically have a great deal of similarity. Indeed, recent compression schemes have 
shown that it is effective to view a genomic sequence with respect to a compression scheme that represents a 
sequence in terms of its differences with a reference sequence, R (e.g., see [4j). That is, we can start from a 
reference sequence, R, which contains the most common components of a typical genomic sequence. Then 
we define each other sequence, Q, in terms of its differences with R. Each difference is defined by an index 
location, i, in R and an operation to perform at that location, such as a substitution, insertion, or deletion. 
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This difference pattern is present, for example, in human mitochondrial DNA, which is the type of ge- 
nomic data we use in our case study. This type of of DNA, which, as we have already mentioned, is inherited 
only through the maternal line and is already available in sequenced form in sizeable enough quantities to 
support obfuscated Mastermind attacks. Moreover, because it is passed only though the maternal line, it 
functions as a highly tuned notion of race, allowing researchers in some cases to trace a person's ancestry to 
individual villages. Thus, mitochondrial DNA is highly sensitive from a privacy-protection viewpoint. 

As shown in recent work of Baldi et al. [4 1 , mitochondrial DNA sequences can be encoded in significantly - 
compressed form by using a standard reference sequence p 30|. This reference sequence, R = rCRS, is 



16,568 bp long. So, in terms of the notation used above, we have N = 16568 and K = A, since there are 
4 types of base pairs possible. But these parameters suggest that there is more variation in the data than 
actually occurs. 

In fact, the vulnerability of DNA sequences to the Mastermind attack is much worse than this in practice. 
For example, there are a limited number of locations along the reference sequence where any changes 
appear statistically in the mitochondrial DNA data. So let us use M to denote the number of different 
possible locations where any query sequence might differ from the reference sequence, R. Worse yet, from 
a privacy-preservation standpoint, the average number of difference between any human DNA sequence and 
the reference is orders of magnitude smaller than M in practice. (We explore these statistics in detail below.) 
Here we show how a Mastermind attack can exploit these statistical properties of genomic data. 

5.1 The Substitution-Only Case 

In this section, we explore the version of the Mastermind attack where the attacker. Bob, engages in a series 
of privacy -preserving protocols with Alice, each of which reveals only the single-count straight-match score 
between Alice's string, Q, and strings provided by Bob, in an iterative online fashion (recall Figure |2^). 
In the attack model we consider. Bob is allowed to use self-constructed sequences in comparisons with Q, 
from which he learns the value of b{Q, Vi) for each of his query strings, Vi. 

Given additional knowledge of the distributional properties of DNA data, we can construct a Mastermind 
attack to take this knowledge into consideration. In this case, we make the assumption that the unknown 
string, Q, differs from a reference string R only through a relatively small number substitutions, which is 
true for example, for 45% of the mitochondrial DNA data. (We will explore the more general case later in 
this section.) 

Our algorithm is an adaptation of an algorithm of Goodrich pOl for solving the boardgame version of 
Mastermind to the specific case of a Mastermind attack on a string Q relative to a reference string R. 

We begin the attack for Bob by having him perform a query against Q with a reference sequence, R. For 
any string, Q, let s{Q) denote the number of substitutional differences Q has with the reference sequence, 
R. Note, then, that our first query (for the reference string R itself) allows us to determine the value of s{Q), 
using the formula 

s{Q) = N -b{Q,R). 

For example, R could be a genomic sequence derived from a sequencing of the DNA of a specific refer- 
ence human or it could be a canonical genomic reference sequence derived from analyzing commonalities 
among a number of human sequences. Even though few humans have presently had their complete genomes 



sequenced 1 11 26 38 1, any of these could serve as a reference, R, for a Mastermind attack on a complete 
genome sequence. For the more wide-spread instances of mitochondrial DNA, the Revised Cambridge 
Reference Sequence (rCRS) (GenBank accession number: AC 000021) is commonly used as a mtDNA ref- 
erence sequence |^^|30j|, and it could serve as the sequence i? in a Mastermind attack on a mitochondrial 
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DNA sequence. 

Imagine that we cyclically order the K characters in our alphabet, so, for instance, if our alphabet is 
{A,C,G,T}, then we could use the cyclic ordering (A,C,G,T,A,C,G,T,. . .)■ Note that this ordering allows us 
to choose any character as a base color, i.e., a "color 0," and then specify all other characters as offsets from 
that base. For example, in the DNA case, we could pick "C" as the base, color 0, in which case "G" becomes 
color 1, "T" becomes color 2, and "A" becomes color 3. Or we could pick "T" as the base, color 0, in which 
case "A" becomes color 1, "C" becomes color 2, and "G" becomes color 3. 

In the context of a Mastermind attack, we consider each character, Ri, in the reference sequence, R, 
to be color "0" for that position, i. Viewed Mathematically, we can then number the K — I remaining 
characters, according to our cyclic ordering, as offsets from these respective color O's. Assuming that Bob's 
first guess, of R, is not a perfect match for the query sequence, Q, then we can view Bob's remaining task as 
that of determining the cardinality and location of all the non-zero offset values for positions in R. In fact, 
if we think of the characters in the respective positions of R as the respective color O's for those positions, 
then we can view the remaining task as that of determining the locations of the colors through K — I. 

After Bob makes his initial guess using R, we then have him perform K — 1 additional queries, each of 
which is a vector of elements that are all the same offset from R, i.e., a vector of all the same "colors" with 
respect to R, but only at the M places that are statistically possible locations for a substitution. Thus, let us 
assume we can view Q as now consisting of just the M places where substitutions may occur (for the other 
locations we simply repeat a guess for color every time). This allows us to initially know the cardinality, 
Co, ci, . . . , ck-1, of every (offset) color in the (compressed) unknown vector, Q. If any Cj = 0, then we 
remove the color i from our alphabet of colors, and update the value of K accordingly. The remainder of 
Bob's computation proceeds as a recursive divide-and-conquer algorithm, which is similar in structure to 
the approach of [ 10[|20|. 

The generic problem is to determine the offset values of all the elements in a range Q[l..r], which 
initially is the entire vector Q = Q[O..N — 1], assuming we know the values of cq, ci, . . . , ck-i, of every 
color in Q[l..r], and each a > 0. If K < 1, we are done; so let us assume without loss of generality that 
K > 2. In addition, we assume inductively that we know, d, the number of instances of color outside of 
the range Q[l..r]. Initially, of course, d = 0. 

Given this initial setup, we split (5[/..r] into Q[l..m\ and Qlm + l..r], where m is in the middle of 
the interval [l,r]. The main challenge, then, is to provide for Q[l..m] and Q[m + l..r] the same setup 
we had for Q[l..r]. This setup can be accomplished by determining the cardinalities, xq,xi, . . . ,xk-i 
and yo,yi, . . . ,yK-i, of every color that respectively appears in Q[l..m\ and Q[m + l..r]. We do this 
with a series of K — 1 additional queries, where we guess that the elements in Q[l..m] are of color i, for 
i = 1,2, . . . , K — I, and that the rest of Q is of color 0. Let the values of these queries be denoted as 
6i, 62, • • • , bx-i, and note that, at this point, we know the following: 

Xi + yi = Ci, for i = 0,1, . . . ,K - 1 (1) 
Xi + yi = bi — d, for i = 1,2, K — 1 (2) 
xo + xi-\ h XK-i = m - I + 1. (3) 

Thus, we can determine yo, as 

co + j:f=i\bi-d)-{m-l + l) 

yo = ^ , 

for yo is counted K times in the sum of cq and all the {bi — d)'s, and the sum of the Xi's is m — ^ + 1, 
by Equation Q. Given the value of y^, we can then determine all the xi values, by using Equation ([T]) for 
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xq and Equation (|2]) for xi, X2, • • • , xk-i- Moreover, once we have all these Xi values, we can determine 
the values, yi, 1/2, • • • , Vk-i, using Equation ([T]l. Finally, we can determine the values d' = d + yo and 
d" = dxQ and use these respectively for the role of d in Q[l..m\ and Q[m + l..r]. This gives us all the values 
necessary to then recursively determine Q[l..m\ and Q[m + l..r]. Of course, if the a values for either of 
these subproblems are all 0, except for one (which would be equal to the size of this problem), then there is 
no need to recursively solve this problem; so we would not perform a recursive call in this case. 

Let us, therefore, analyze the number of vector guesses performed by this algorithm. Ignoring for the 
time being the initial set of K guesses, note that we only continue to search if we are guaranteed to be 
honing in on a substitution. Thus, adding back the initial K guesses, we get that the total number of guesses 
is at most 

s{Q)\logM]+K. 

Thus, we have the following. 

Theorem 3: Given an unknown length-N sequence Q, defined on an alphabet of size K, with Q having M 
possible locations of deviation from a reference sequence, R, a malicious Mastermind attacker can discover 
Q in polynomial time using s{Q) [log M] + K guesses, each of which reveals only the number of positions 
where Q and the test sequence match and where s{Q) denotes the number of substitutions that would 
transform R into Q. 

As we note in Section [6j this performance is more than adequate to show that nearly half of all mito- 
chondrial DNA data in our case study are vulnerable to this version of the Mastermind attack. Before we 
provide those statistics, however, let us study how the Mastermind attack with sequence-alignment queries 
can be streamlined to exploit DNA data distributions. 

5.2 The Sequence-Alignment Case 

As mentioned above, roughly half of the sequences in the mitochondrial DNA data set include insertions 
and/or deletions in addition to substitutions in the reference sequence, R. Thus, we discuss in this subsection 
how we can modify the Mastermind attack algorithm of Section [4] to take advantage of the distributional 
properties common in genomic data sets, so as to discover a query sequence that can have arbitrary kinds of 
differences with the reference sequence, R. In this case, we view differences with R procedurally as events, 
each of which is either a singleton deletion, or an arbitrary-length insertion, which would transform R into 
the query sequence, Q. (Note: for this algorithm, we view a substitution as actually occurring as a deletion 
event followed by an insertion event.) 

In this case, we run the attack algorithm in two phases. In Phase 1, we aim to discover all the deletion 
events, and in Phase 2, we aim to discover all the insertion events. In both phases, we make the simplifying 
assumption that insertion and deletion events are disjoint. That is, they don't overlap or interfere with one 
another. This assumption is based on the fact that these events come from a statistical characterization of 
genomic sequences, which is designed to keep events disjoint (for overlapping events are better subdivided 
further and considered as separate sub-events). So, for example, we assume that there is no insertion event 
that is then followed by a deletion event that then removes part of the sequence that was just inserted. 

We begin by performing a guess for the reference sequence, R. Armed with the sequence-alignment 
score, a{Q, R), for R, we then perform a divide-and-conquer computation to find all the deletion events that 
occur in going from R to Q. Note that if we next perform a guess V for a collection of deletion events at 
some subset of the M statistically possible (deletion) locations in R, then we can detect how many deletions 
actually occurred at these locations. Moreover, note that the insertion events don't change this score, since 



13 



the insertions and deletions do not interfere, by assumption. For each deletion event that is present in one of 
the queried locations, then our score will not change with respect to the score for R, and, for each location 
that should not be deleted, we will record a score for V that is one worse than that for R. Thus, we can 
determine the number of deletion events for any test we do by the difference between the score we observe 
and the score we would expect if all of the deletions are removing actual matches. That is, if we test for r 
singleton deletion events in V , then the number that actually occur is a{Q, V) — {a{Q, R) — r), where a is 
the sequence-alignment score function. 

Let Zijv/ = {zi, 22, ... , zm} be a set of Boolean variables, such that Zj is 1 if and only if the ith 
statistically possible deletion event in R actually occurs in going from R to Q. We can perform a divide- 
and-conquer search in Zi^m to determine which of the Zj's are 1. We begin by testing for all the deletion 
events in Zi^m- This gives us the number of I's in Zi^m- We then perform a test for every deletion event 
in Zi^M/2 = {zi,-.., ZM/2}, which by deduction gives us the number in Zm/2+i,m = {•Zm/2+i, • • • , zm}- 
We then recursively determine the number in either or both of these two sets so long as there is at least one 
deletion event in that set. Thus, we perform a divide-and-conquer parallel "binary" search for each of the 
exact locations of singleton deletions. Once we have completed this computation for R, with queries against 
Q, we will have determined the locations of all the deletion events from R to Q, including those deletions 
that are really substitution events. Thus, this set of guesses uses at most 1 + d{Q) [log M] tests, where d{Q) 
is the set of (singleton) deletion events in going from Rto Q. 

Once we know the locations of all the deletions in going from i? to Q, we perform a second set of binary 
searches, just among these locations, to find the locations among this group that are actually the sites of 
substitution events. Let us now define R' to be the reference sequence resulting from performing the events 
we discovered in Phase 1. In particular, we perform a binary search for each of the K colors, with respect 
to R', searching, for each color i, in the statistically possible insertion locations in R' where we improve 
our score by adding a single character of color i. Note that there may be more than a single character of 
color i inserted at this location, but it is sufficient to do a single character query to determine that there is an 
insertion here, since there is a non-deleted element between every possible insertion location in R'. 

Since we continue to perform recursive binary-type searches for any insertion locations that actually 
cause insertions, then the the set of additional guesses we do in this part of the second phase is at most 
K + e{Q) [log M] , where e(Q) is the number of insertion events. 

At this point in the algorithm, we know where all the insertion events are located, but we don't know 
the full extent of each of their sizes. So for each location, we perform a set of K guesses of length 2 to 
see if we get a higher score by considering a longer insertion. If there are no differences from the singleton 
queries, then we can infer the length of the insertion from the previous queries. Otherwise, we perform a 
set of K guesses of length 3, 4, and so on, until we observe no change from the previous set of guesses. 
Thus, with a total number of guesses equal to Ke{Q), where e{Q) is the total size of all the insertion events, 
we discover the length of each insertion event. To complete the computation, then, we perform a miniature 
version of our algorithm from Section |4] at each location determined to be to site of an insertion event. Each 
such computation requires {m-\-l)K guesses, where m is the length of the insertion. Thus, the total number 
of guesses made in this part of Phase 2 is {e{Q) + 1)K. Therefore, we have the following. 

Theorem 4: Given an unknown length-N sequence Q, defined on an alptiabet of size K, with Q having M 
possible locations of deviation from a reference sequence, R, a malicious Mastermind attacker can discover 
Q in polynomial time using {d{Q) + e{Q)) [log M] + {e{Q) + 2)K + 1 guesses, each of which reveals only 
the number of positions where Q and the test sequence match, using sequence-alignment LCS tests, where 

• d{Q) is the number of deletion events. 
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• e{Q) is the number of insertion events, 

• e{Q) is the total length of all insertion events. 

6 Case Study for Mitochondrial DNA 

We are at the point where hundreds of thousands of people have had their mitochondrial DNA (mtDNA) 
sequenced l[5j[29|, which is typically about 16,500 base pairs (bp) long, whereas the entire diploid human 
genome is roughly 6 billion bp long. Interestingly, since mtDNA is transferred only along the maternal line, 
scientists have used differences from a reference mtDNA sequence as a way to plot human migration from 
the earliest days of the modern human species. (See Figure|4]) 




Figure 4: A confluent illustration fT3) of the pattern of human migration implied by mtDNA mutations f5\ 
[29) . Each letter stands for a major human mitochondrial haplogroup, that is, a canonical set of genetic 
mutations from a common ancestor. 

Because of this knowledge of migration patterns and its correlation to known mtDNA mutations, given 
someone's mtDNA sequence, it is possible to trace their maternal ancestry back to individual villages [5], 
just by identifying differences in their mtDNA to a reference sequence, e.g., rCRS (see Figure |5]l. In other 
words, mtDNA alone is sufficient to determine a person's ethnic background with incredible accuracy. Thus, 
we are at a point where privacy is a real concern with respect to genomic sequences, and this concern is sure 
to increase in the future. 

In addition to ethnicity, there are, of course, other privacy concerns with respect to genomic data, in- 
cluding sensitive information related to disease susceptibility, and possible genetic influences on sexual 
orientation, personality, addiction, and intelligence. Concerns that employers or insurers will use genetic 
information to screen those at high risk for a disease are already a public concern and stories involving such 
risks are widespread in the press. Indeed, the U.S. government and several states have already created laws 
dealing with DNA data access, and many more are considering such legislation. Thus, there is a need for 
technologies that can safeguard the privacy and security of genomic data. 

Fortunately, several researchers have started exploring privacy-preserving data querying methods that 
can be applied to genomic sequences (e.g., see |2j^l5j^l6J). That is, cryptographic techniques can be used 
to allow for queries to be performed in a way that answers the specific question — such as a score rating the 
quality of a query for DNA matching or sequence alignment — but does not reveal any other information 
about the data, such as race or disease risk of the individual whose DNA is being queried. 
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GATCACAGGTCTATCACCCTATTAA 
CCACTCACGGGAGCTCTCCATGCAT 
TTGGTATTTTCGTCTGGGGGGTATG 
CACGCGATAGCATTGCGAGACGCTG 
GAGCCGGAGCACCCTATGTCGCAGT 
ATCTGTCTTTGATTCCTGCCTCATC 

ATCTGGTTCCTACTTCAGGGTCATA 
AAGCCTAAATAGCCCACACGTTCCC 
CTTAAATAAGACATCACGATG 



Figure 5: A portion of the Revised Cambridge Reference Sequence, rCRS (GenBank accession number: 
AC 000021), which is 16,568 bp long. 



The purpose of this case study is to show that, while being sufficient for single-shot comparisons of DNA 
sequences, such cryptographic techniques have a weakness when they are employed repeatedly. Specifically, 
we explore in this section how the Mastermind attack allows a genomic querier. Bob, to iteratively discover 
the full identity of a genomic query sequence, Q, with surprising efficiency, even if each comparison of 
Q with Bob's sequences are done using cryptographic privacy-preserving protocols. It is not surprising 
that iterated privacy-preserving sequence comparisons leak some information about the sequences being 
compared; what is surprising is how quickly the Mastermind attack can work, especially on genomic data. 

To demonstrate the vulnerability of real-world DNA data to the Mastermind attack, we have performed a 
case study of our distribution-based Mastermind attack algorithms. We used 1000 human mitochondrial se- 
quences downloaded from a recent version of GenBank ( |http://www.ncbi.nlm.nih.gov /Genban k/index.html| . 
We focused on the sequences alone, ignoring any header and other information, and have simulated Mas- 
termind attacks on each one. The Revised Cambridge Reference Sequence (rCRS) (GenBank accession 



number: AC 000021) was also downloaded and used as the reference sequence ||7j[8 30 1. The reference se- 
quence is 16,568 bp long. All the sequences were aligned to the reference sequence and, for each sequence, 
the indices of the location of each variation were recorded together with the type (substitution, insertion, 
deletion) and content of each variation. This step is also essential if one is interested in compressing the 
data Q, for example. Statistics for the number of substitutions, deletions, and insertions for this data set of 
1000 mtDNA sequences is given in Table [T] 





mean 


standard dev. 


Substitutions 


28.00 


18.38 


Deletions 


0.90 


2.46 


Insertions 


0.95 


1.10 



Table 1: Frequency statistics for 1000 mtDNA sequences. Mean and standard deviation statistics are given 
for the frequency of substitutions, deletions, and insertions in going from the reference sequence, R = rCRS, 
to each sampled sequence. 

Of the 1000 sequences, 453 have only substitution events with respect to the reference sequence, R = 
rCRS. So we used this subset of 453 sequences to test the simulated performance of the method of Theo- 
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[3] The distribution of the number of substitutions in each of these sequences is shown in Figure 
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Figure 6: Histogram of number of substitutions in 1000 mtDNA with respect to the reference sequence, 



Note that these frequencies do not follow a normal distribution, which shows the importance of our using 
real-world data, such as this, rather than randomly-generated or simulated data. The statistical diversity of 
the mtDNA data is actually a reflection of the racial diversity of the people whose mtDNA data is included 
in our data set. That is, edit distance from the reference sequence, R = rCRS, across the human species, is 
not uniformly or normally distributed. Instead, edit distance from rCRS is a reflection of human migration 
patterns, as illustrated in Figure |4] 

The 45.3% of the sampled mtDNA sequences with substitution-only modifications from rCRS are ex- 
actly the set of sequences that can be effectively discovered by the single-count Mastermind attack of The- 
orem |3] Thus, we simulated the performance of this attack on each one of these sequences and tabulated 
the number of guesses that would be needed in each case in order to discover the complete identity of each 
sequence. Interestingly, 90% of the simulated substitution-only Mastermind attacks completed with 375 
guesses or less. The complete distribution of single-count Mastermind attack lengths for this data set are 
shown in Figure [7] 

All 1000 sampled mtDNA sequences were then used to test the performance of the method of Theorem|4] 
Sequence-alignment Mastermind attacks were simulated for each such mtDNA sequence while the number 
of sequence-alignment tests were counted for each. Interestingly, 90% of these simulated subsequence- 
alignment Mastermind attacks completed with 875 guesses or less. And some completed with much fewer 
than this. The complete distribution of sequence-alignment Mastermind attack lengths for this data set is 
shown in Figure [8] 

7 Discussion and Future Directions 

We have shown that, even though the single-count and sequence-alignment Mastermind satisfiability prob- 
lems are NP-complete, one can effectively mount Mastermind attacks on arbitrary genomic sequences just 
by knowing basic information about the length of the sequences and the number of characters in the alpha- 
bet used to construct those sequences. Moreover, if one has some basic statistical information about these 
sequences, relative to a reference sequence, then one can mount the Mastermind attack with surprising effec- 
tiveness. In fact, we provided a case study suggesting that such attacks are already possible and surprisingly 



R = rCRS. 
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Figure 7: Histogram of Mastermind attack lengths for 453 substitution-only mtDNA sequences with stan- 
dard single-count Mastermind scores. The mean attack length for this data set was 219.6 and the standard 
deviation was 139.1. 



efficient for mtDNA sequences. 

One conclusion to draw from this work is that privacy-preserving protocols for performing a query with 
a sequence, Q, against a genomic database, D, should take into account the entire set of comparisons [T?*], 
with Q and the sequences in D, rather than relying on the privacy-preservation of each individual comparison 
in turn. For example, in the usage model where Bob is a user querying a genomic database, the Mastermind 
attack is weakened if it is difficult for Bob to know the index of the sequences he is comparing against — for 
example, if the database owner, Alice, presents her sequences in a different random order each time. Such an 
obfuscation does not defeat the Mastermind attack, however, if Bob is able to use other reasoning inferences 
to match scores of his query sequences across multiple queries in Alice's database of sequences. 

In terms of further exploration of the vulnerability of genomic data to the Mastermind attack, one in- 
teresting direction for future work would be to test the vulnerability of entire human genomes to the Mas- 
termind attack, once we have enough completed genomes to do such an experimental study. In addition, 
other directions for future research therefore could include new, efficient privacy-preserving schemes for 
querying entire genomic databases with respect to sequence-alignment queries. Such results would negate 
the privacy-exposing vulnerabilities of the Mastermind attack. 
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Figure 8: Histogram of simulated Mastermind attack lengths for 1000 mtDNA sequences with sequence- 
alignment scores. The mean sequence-alignment simulated Mastermind attack length was 536.3 with a 
standard deviation of 373.9. 
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